-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design:add outlier detection design doc #5460
base: main
Are you sure you want to change the base?
design:add outlier detection design doc #5460
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5460 +/- ##
=======================================
Coverage 78.53% 78.53%
=======================================
Files 138 138
Lines 19327 19327
=======================================
Hits 15179 15179
Misses 3856 3856
Partials 292 292 |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
@skriss @sunjayBhatia |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the design! sorry to take so long in reviewing but left some comments from a first pass
design/outlier-detection-design.md
Outdated
// +optional | ||
// +kubebuilder:validation:Maximum=100 | ||
// +kubebuilder:default=0 | ||
MinHealthPercent *uint32 `json:"minHealthPercent,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we need to offer configurability for panic threshold in this case since we have the above "max ejection %" config and we are not utilizing the panic threshold ability to "fail" traffic when in panic mode, is there a specific reason you have for adding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a reference to some of the design concepts of istio. After careful consideration, I think we can remove this field and keep the default value of contour.Thanks for your comment
|
||
## Open Issues | ||
- Whether the existing supported configuration options meet the needs of the user. | ||
- Should a global switch be provided to record [ejection event logs][2]? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of adding some discussion and exploration on this, I think with our active healthchecking configuration, it isn't ideal as-is but (IMO) people find it generally easier to explain and understand why their upstream is unhealthy (it returned an unhealthy status code from their healthcheck endpoint)
with passive healthchecking I can foresee some support requests etc. where users are confused as to why their services are deemed unhealthy, so having logs etc. to explain things would be great
we could also take advantage of the active healthchecking event log configuration as well
I think a great outcome of implementing this feature could be a debugging guide, including some mention of what stats to monitor (e.g. from https://www.envoyproxy.io/docs/envoy/latest/configuration/upstream/cluster_manager/cluster_stats#config-cluster-manager-cluster-stats-outlier-detection)
we would have to likely add a bootstrap flag to change the bootstrap ClusterManager
configuration to enable logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, adding a bootstrap flag to handle whether to enable logging can better help people confirm the cause of the failure. If it is considered feasible, I will create a new PR to implement this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can also add an active health check flag to record health check failure events
// Defaults to false. | ||
// +optional | ||
// +kubebuilder:default=false | ||
SplitExternalLocalOriginErrors bool `json:"splitExternalLocalOriginErrors"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nice advantage of this passive healthcheck config is we can get more granular failure details like this vs. our existing active http healthcheck configuration, I would err on the side of always enabling this to separate server errors from possible network level errors, but curious to see what others think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I would lean on the side of enabling this by default. Regarding granularity, we should also allow users to tune consecutive_gateway_failure
and consecutive_local_origin_failure
; otherwise, this toggle by itself feels too rigid.
We should also make sure to set sane defaults and that the accompanying documentation is clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your review.The setting of consecutive_local_origin_failure
is already in the design document. Consecutive_gateway_failure was not intended to be configured at the beginning. Of course, if this is a good design, I will update this design document to support consecutive_gateway_failure
.
@clayton-gonsalves @sunjayBhatia
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my 2 cents here is to remove the SplitExternalLocalOriginErrors
field and set it to always true in the generated Envoy config
in addition, we should choose one of the 5xx error codes or gateway errors to expose as a configurable parameter to users for upstream originated error control, my 2 cents here again is that the consecutive_5xx
(existing ConsecutiveServerErrors
field in this config) is sufficient, I don't see too much utility in limiting the classified errors to 502, 503, and 504
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
Soon I will submit a PR with detailed code implementation. |
cc @clayton-gonsalves that would also interest us as well, do you want to do a review (or reassign it internally)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this Design; I am looking forward to this.
Here are a few comments from my side and a question;
I like the granularity of having the outlier detection per service as it allows the service owners to fine-tune the as per their requirements.
Did you also explore the possibility of defining a config at the vhost level or the global level?
This would also align with the patterns we have in place for rate limiting and extauth for example.
I can envision a scenario where a cluster manager/admin sets these base policies at the cluster level, and advanced service owners can tune these as per their needs if required.
// Defaults to false. | ||
// +optional | ||
// +kubebuilder:default=false | ||
SplitExternalLocalOriginErrors bool `json:"splitExternalLocalOriginErrors"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I would lean on the side of enabling this by default. Regarding granularity, we should also allow users to tune consecutive_gateway_failure
and consecutive_local_origin_failure
; otherwise, this toggle by itself feels too rigid.
We should also make sure to set sane defaults and that the accompanying documentation is clear.
The global OutlierDetection configuration supports configuring the log path of abnormal events. I have not found the related configuration of vhost level and global level in the relevant documents of envoy. |
Apologies for the bad link. Yeah, I don't think envoy supports it. We would need to implement it ourselves like how global auth is implemented. |
Yes, it is possible to implement it the same way using global authentication. this is a good suggestion. |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
Signed-off-by: yy <[email protected]> outlier-detection-design.md Signed-off-by: yy <[email protected]> update outlier-detection-design.md Signed-off-by: yy <[email protected]> update design Signed-off-by: yy <[email protected]>
Signed-off-by: yy <[email protected]>
98cc6dc
to
c579c38
Compare
reopening this to restart some conversation shere |
// Defaults to false. | ||
// +optional | ||
// +kubebuilder:default=false | ||
SplitExternalLocalOriginErrors bool `json:"splitExternalLocalOriginErrors"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my 2 cents here is to remove the SplitExternalLocalOriginErrors
field and set it to always true in the generated Envoy config
in addition, we should choose one of the 5xx error codes or gateway errors to expose as a configurable parameter to users for upstream originated error control, my 2 cents here again is that the consecutive_5xx
(existing ConsecutiveServerErrors
field in this config) is sufficient, I don't see too much utility in limiting the classified errors to 502, 503, and 504
// Defaults to 0s. | ||
// +optional | ||
// +kubebuilder:validation:Pattern=`^(((\d*(\.\d*)?s)|(\d*(\.\d*)?ms))+)$` | ||
MaxEjectionTimeJitter *string `json:"maxEjectionTimeJitter,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just set a default for this and not expose this field to start with, this config is pretty big so limiting what users need to think about would be great
|
||
|
||
## Open Issues | ||
- Whether the existing supported configuration options meet the needs of the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious for people who have used this in the past, particularly at scale, whether the "consecutive errors" mechanism for configuring outlier detection is actually desirable vs. the failure percentage mechanism
one could foresee a scenario in which a host in a cluster should be detected as an outlier however it does not consistently error consecutively but does still have a high failure percentage (say the consecutive errors was configured at 5, and you consistently get 3 out of 5 requests that fail), you may not detect that such a host should be ejected as quickly and also may not consistently eject it for as long as it needs to be ejected: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier#ejection-algorithm
enforcing_consecutive_gateway_failure: 0 | ||
``` | ||
|
||
Notes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably add some discussion/documentation of the successful_active_health_check_uneject_host
field, users should be advised that this defaults to true so if active health checking is also enabled this could cause some hosts flip-floopping, esp. if the active health check is configured to a different endpoint than the exposed route
- If consecutiveServerErrors is specified as 0, and splitExternalLocalOriginErrors is true, then local errors will be ignored.This is especially useful when the upstream service explicitly returns a 5xx for some requests and you want to ignore those responses from upstream service while determining the outlier detection status of a host. | ||
- When accessing the upstream host through an opaque TCP connection, connection timeouts, connection errors, and request failures are all considered 5xx errors, and therefore these events are included in the 5xx error statistics. | ||
- Please refer to the [official documentation of Envoy][1] for more instructions. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also include some discussion/documentation on the different behavior between http router and tcp filter when this is in the mix
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
adding passive health checks to services to enhance routing reliability.
close #5317